摘要 :
The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misse...
展开
The memory wall has motivated many enhancements to cache management policies aimed at reducing misses. Cache compression has been proposed to increase effective cache capacity, which potentially reduces capacity and conflict misses. However, complexity in cache compression implementations could increase cache power and access latency. On the other hand, advanced cache replacement mechanisms use heuristics to reduce misses, leading to significant performance gains. Both cache compression and replacement policies should collaborate to improve performance. In this paper, we demonstrate that cache compression and replacement policies can interact negatively. In many workloads, performance gains from replacement policies are lost due to the need to alter the replacement policy to accommodate compression. This leads to sub-optimal replacement policies that could lose performance compared to an uncompressed cache. We introduce a novel, opportunistic cache compression mechanism, Base-Victim, based on an efficient cache design. Our compression architecture improves performance on top of advanced cache replacement policies, and guarantees a hit rate at least as high as that of an uncompressed cache. For cache-sensitive applications, Base-Victim achieves an average 7.3% performance gain for single-threaded workloads, and 8.7% gain for four-thread multi-program workload mixes.
收起
摘要 :
A practical way to increase the effective capacity of a microprocessor’s cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression sc...
展开
A practical way to increase the effective capacity of a microprocessor’s cache, without physically increasing the cache size, is to employ data compression. Last-Level Caches (LLC) are particularly amenable to such compression schemes, since the primary purpose of the LLC is to minimize the miss rate, i.e., it directly benefits from a larger logical capacity. In compressed LLCs, the cacheline size varies depending on the achieved compression ratio. Our observations indicate that this size information gives useful hints when managing the cache (e.g., when selecting a victim), which can lead to increased cache performance. However, there are currently no replacement policies tailored to compressed LLCs; existing techniques focus primarily on locality information. This article introduces the concept of as a way to maximize the performance of compressed caches. Upon analyzing the benefits of considering size information in the management of compressed caches, we propose a novel mechanism—called Effective Capacity Maximizer (ECM)—to further enhance the performance and energy consumption of compressed LLCs. The proposed technique revolves around four fundamental principles: ECM Insertion (ECM-I), ECM Promotion (ECM-P), ECM Eviction Scheduling (ECM-ES), and ECM Replacement (ECM-R). Extensive simulations with memory traces from real applications running on a full-system simulator demonstrate significant improvements compared to compressed cache schemes employing conventional locality-aware cache replacement policies. Specifically, our ECM shows an average effective capacity increase of 18.4 percent over the Least-Recently Used (LRU) policy, and 23.9 percent over the Dynamic Re-Reference Interval Prediction (DRRIP) scheme. This translates into average system performance improvements of 7.2 percent over LRU and 4.2 percent over DRRIP. Moreover, the average energy - onsumption is also reduced by 5.9 percent over LRU and 3.8 percent over DRRIP.
收起
摘要 :
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cach...
展开
Cache memory plays a crucial role in determining the performance of processors, especially for embedded processors where area and power are tightly constrained. It is necessary to have effective management mechanisms, such as cache replacement policies, because modern embedded processors require not only efficient power consumption but also high performance. Practical cache replacement algorithms have focused on supporting the increasing data needs of processors. The commonly used Least Recently Used (LRU) replacement policy always predicts a near-immediate re-reference interval, hence, applications that exhibit a distant re-reference interval may perform poorly under LRU replacement policy. In addition, recent studies have shown that the performance gap between LRU and theoretical optimal replacement (OPT) is large for highly-associative caches. LRU policy is also susceptible to memory-intensive workloads where a working set is greater than the available cache size. These reasons motivate the design of alternative replacement algorithms to improve cache performance. This paper explores a low-overhead, high-performance cache replacement policy for embedded processors that utilizes the mechanism of LRU replacement. Experiments indicate that the proposed policy can result in significant improvement of performance and miss rate for large, highly-associative last-level caches. The proposed policy is based on the tag-distance correlation among cache lines in a cache set. Rather than always replacing the LRU line, the victim is chosen by considering the LRU-behavior bit of the line combined with the correlation between the cache lines' tags of the set and the requested block's tag. By using the LRU-behavior bit, the LRU line is given a chance of residing longer in the set instead of being replaced immediately. Simulations with an out-of-order superscalar processor and memory-intensive benchmarks demonstrate that the proposed cache replacement algorithm can increase overall performance by 5.15% and reduce the miss rate by an average of 11.41%. (C) 2015 Elsevier B.V. All rights reserved.
收起
摘要 :
As caches become larger and shared by an increasing number of cores, cache management is becoming more important. This paper explores collaborative caching, which uses software hints to influence hardware caching. Recent studies h...
展开
As caches become larger and shared by an increasing number of cores, cache management is becoming more important. This paper explores collaborative caching, which uses software hints to influence hardware caching. Recent studies have shown that such collaboration between software and hardware can theoretically achieve optimal cache replacement on LRU-like cache. This paper presents Pacman, a practical solution for collaborative caching in loop-based code. Pacman uses profiling to analyze patterns in an optimal caching policy in order to determine which data to cache and at what time. It then splits each loop into different parts at compile time. At run time, the loop boundary is adjusted to selectively store data that would be stored in an optimal policy. In this way, Pacman emulates the optimal policy wherever it can. Pacman requires a single bit at the load and store instructions. Some of the current hardware has partial support. This paper presents results using both simulated and real systems, and compares simulated results to related caching policies.
收起
摘要 :
This paper presents a cache replacement policy which has specifically been developed for the efficient media caching in streaming media cache servers. For efficient media caching, the proposed policy takes into account the periodi...
展开
This paper presents a cache replacement policy which has specifically been developed for the efficient media caching in streaming media cache servers. For efficient media caching, the proposed policy takes into account the periodic patterns of users' requests in addition to the parameters such as reference count, amount of media contents delivered to the clients, and reference time. These values are collected at run-time for each cached object. In order to adequately and promptly adopt to the changing characteristics of users preferences, the policy introduces, in particular, the concept of weighted-window for replacement with which higher priorities are given to more recently referenced media contents and consequently they are less likely to be replaced. We present and analyze the simulation results showing that the proposed policy has outperformed the conventional replacement policies such as LRU, LFU, and SEG in terms of hit ratio, byte-hit ratio, delayed start, and cache input. (c) 2005 Elsevier B.V. All rights reserved.
收起
摘要 :
Measurement-Based Probabilistic Timing Analysis (MBPTA) facilitates the analysis of complex software running on hardware comprising high-performance features. MBPTA also aims at preventing additional analysis costs for timing anal...
展开
Measurement-Based Probabilistic Timing Analysis (MBPTA) facilitates the analysis of complex software running on hardware comprising high-performance features. MBPTA also aims at preventing additional analysis costs for timing analysis techniques and preserving the confidence on derived WCET estimates. Cache behavior has a deep influence on WCET estimates and hence on "the amount of software" that can be consolidated onto a single hardware platform. Deterministic replacement policies such as LRU (Least Recently Used) and NMRU (Non-Most Recently Used) have systematic pathological cases that may lead to high execution times and WCET estimates. Instead, random replacement (RR) decreases pathological cases probability, at the cost of temporal locality.
收起
摘要 :
Multicore processor is a single processor which contains number of cores on a chip. The cores are functional units made up of computation units and caches. In multicore System cache allocation technology helps address shared resou...
展开
Multicore processor is a single processor which contains number of cores on a chip. The cores are functional units made up of computation units and caches. In multicore System cache allocation technology helps address shared resource concerns by providing software control of where data is allocated into the cache, enabling isolation and prioritization of key applications. In this paper, cache replacement policies and cache allocation technology are discussed. A cache allocation technology keeps performance up of processor and protects form timing attacks where replacement policies determine which data blocks should be removed from the cache when a new data block is added. We also analyses the performance of L1 instruction and Data caches with different replacement policies such as LRU (Least Recently Used), FIFO (First In First Out), RANDOM, PLRU (Pseudo Least Recently Used) on the performance of L1 instruction and Data caches.
收起
摘要 :
For those cache hierarchy levels where program locality is not as evident as in L1, LRU replacement does not seem to be the optimal solution to determine which blocks will be requested soon. The literature is prolific on alternati...
展开
For those cache hierarchy levels where program locality is not as evident as in L1, LRU replacement does not seem to be the optimal solution to determine which blocks will be requested soon. The literature is prolific on alternative reuse-distance estimations at last on-chip cache level, proving the difficulty of achieving an optimal hit rate. One of the key aspects for performance is knowing inter and intra application reuse-distance variability. Many solutions already do this, but most of them rely on a simple choice among a few alternative policies. The experiments performed to motivate the proposal confirm application variability, but also show that the behavior of applications is much more than bimodal. This means that there is a performance gap that current hybrid policies are not able to cover. In this paper we propose a mobile insertion position replacement policy (MIP), which combines well known LRU ordering and promotion policies with a completely adaptive insertion mechanism. The dynamic behavior of insertion is able to capture hit-rate variability in a more accurate way. Making use of set dueling and dynamic set sampling for prediction, our mechanism continuously estimates the insertion position that maximizes the cache hit rate. The hardware overhead compared to a LRU replacement algorithm is merely three 3-bit saturating counters per LLC bank. Our experiments show that for a wide range of applications, MIP is able to improve the hit rate of LRU by 30% on average. MIP outperforms current state-of-the-art replacement policies with a similar implementation cost by 10% on average and in single-thread or multi-thread workloads by 20%. (C) 2015 Elsevier B.V. All rights reserved.
收起
摘要 :
In scalable CC-NUMA multiprocessors, it is crucial to reduce the average memory access time. For applications where the second-level (L2) cache is large enough, we propose a split L2 cache to utilize the surplus space. The split L...
展开
In scalable CC-NUMA multiprocessors, it is crucial to reduce the average memory access time. For applications where the second-level (L2) cache is large enough, we propose a split L2 cache to utilize the surplus space. The split L2 cache is composed of a traditional LRU cache and an RVC (Remote Victim Cache) which only stores the data of remote memory address range. Thus, it reduces the average L2 cache miss time by keeping remote blocks that would be discarded otherwise. Though the split cache does not reduce the miss rates, it is observed to reduce the total execution time effectively by up to 27%. It even outperform an LRU cache of double size.
收起
摘要 :
Solid State Drives (SSDs) have been extensively deployed as the cache of hard disk-based storage systems. The SSD-based cache generally supplies ultra-large capacity, whereas managing so large a cache introduces excessive memory o...
展开
Solid State Drives (SSDs) have been extensively deployed as the cache of hard disk-based storage systems. The SSD-based cache generally supplies ultra-large capacity, whereas managing so large a cache introduces excessive memory overhead, which in turn makes the SSD-based cache neither cost-effective nor energy-efficient. This work targets to reduce the memory overhead introduced by the replacement policy of SSD-based cache. Traditionally, data structures involved in cache replacement policy reside in main memory. While these in-memory data structures are not suitable for SSD-based cache any more since the cache is much larger than ever. We propose a memory-efficient framework which keeps most data structures in SSD while just leaving the memory-efficient data structure (i.e., a new bloom proposed in this work) in main memory. Our framework can be used to implement any LRU-based replacement policies under negligible memory overhead. We evaluate our proposals via theoretical analysis and prototype implementation. Experimental results demonstrate that, our framework is practical to implement most replacement policies for large caches, and is able to reduce the memory overhead by about .
收起